-
Notifications
You must be signed in to change notification settings - Fork 2.8k
[Transformations][MOE] Add MOE internal op and fuse vectorized MatMul experts into MOE #32183
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Transformations][MOE] Add MOE internal op and fuse vectorized MatMul experts into MOE #32183
Conversation
| OV_OP_SCOPE(internal_MOE_validate_and_infer_types); | ||
| // TODO: Add inputs validation | ||
|
|
||
| set_output_type(0, get_input_element_type(0), get_input_partial_shape(0)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can also use shape of weights for dimension size deduction if some of them is unknown in input hidden_state
|
When you merge this transform, please disable the transform in GPU plugin until we support the moe subgraph, to prevent crash using the default packages. |
| /// (input to final multiplication) | ||
| /// 2: router_topk_output_indices - [..., topk] indices of selected top-k experts | ||
| /// 3: w0_weight - expert weights for first projection, shape [num_experts, inter_size, hidden_size] or | ||
| /// [num_experts, hidden_size, 2 * inter_size] if fused |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think [num_experts, hidden_size, 2 * inter_size] <= this will be transposed to [num_experts, 2*inter_size, hidden_size].
If three is the case when the weights are not transposed, we'll need to have a flag whether the weight is transposed or not.
Or, If we can assume that the fused MoE has the weight transposed always, it would be best.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As we discussed, it will be adjusted to reflect MatMul(transpose_a=False, transpose_b=True),
related PR:
This transformation is not enabled by default, it should be enabled in each plugin with MOE plugin support. |
### Details: In this PR we introduce yet another operation "GatherMatmu", which essentially does gemv operations over the current tokens and the active experts. As the first step, we perform gemv operation using the dnnl::inner_product. But obviously this solution is suboptimal, as it doesn't give a fine grain control over parallelization, and in the case of many tokens being processed by a specific expert (prefill), having gemm operation may be more optimal as the tokens may be batched and we can do SIMD level parallelization by tokens as well. Also this PR contains all the essential transformations that allow to enable a few common MoE patterns. MoE pattern matcher is based on #32183 Related oneDNN fork PR: openvinotoolkit/oneDNN#292 ### Tickets: - CVS-171910 --------- Co-authored-by: Vladislav Golubev <[email protected]>
Details:
This transformation is for compile time and is not enabled by default, it should be enabled in each plugin with MOE plugin support.
Example registration of the fusion transformation for CPU plugin: 41145cf
MOE internal op spec PR:
Preliminary requirements (offline transformations):
Patterns match MatMul (transpose_a=False, transpose_b=True), for batched MatMuls preliminary update of MatMulConstTransposesExtraction is needed:
Fusion of separate MatMul experts into vectorized (batched) MatMul:
Tickets: